Learning goals

Lab description

For this lab we will be dealing with the meteorological dataset met. In this case, we will use data.table to answer some questions regarding the met dataset, while at the same time practice your Git+GitHub skills for this project.

This markdown document should be rendered using github_document document.

Part 1: Setup a Git project and the GitHub repository

  1. Go to wherever you are planning to store the data on your computer, and create a folder for this project

  2. In that folder, save this template as “README.Rmd”. This will be the markdown file where all the magic will happen.

  3. Go to your GitHub account and create a new repository of the same name that your local folder has, e.g., “JSC370-labs”.

  4. Initialize the Git project, add the “README.Rmd” file, and make your first commit.

  5. Add the repo you just created on GitHub.com to the list of remotes, and push your commit to origin while setting the upstream.

Most of the steps can be done using command line:

# Step 1
cd ~/Documents
mkdir JSC370-labs
cd JSC370-labs

# Step 2
wget https://raw.githubusercontent.com/JSC370/jsc370-2023/main/labs/lab05/lab05-wrangling-gam.Rmd
mv lab05-wrangling-gam.Rmd README.Rmd
# if wget is not available,
curl https://raw.githubusercontent.com/JSC370/jsc370-2023/main/labs/lab05/lab05-wrangling-gam.Rmd --output README.Rmd

# Step 3
# Happens on github

# Step 4
git init
git add README.Rmd
git commit -m "First commit"

# Step 5
git remote add origin git@github.com:[username]/JSC370-labs
git push -u origin master

You can also complete the steps in R (replace with your paths/username when needed)

# Step 1
setwd("~/Documents")
dir.create("JSC370-labs")
setwd("JSC370-labs")

# Step 2
download.file(
  "https://raw.githubusercontent.com/JSC370/jsc370-2023/main/labs/lab05/lab05-wrangling-gam.Rmd",
  destfile = "README.Rmd"
  )

# Step 3: Happens on Github

# Step 4
system("git init && git add README.Rmd")
system('git commit -m "First commit"')

# Step 5
system("git remote add origin git@github.com:[username]/JSC370-labs")
system("git push -u origin master")

Once you are done setting up the project, you can now start working with the MET data.

Setup in R

  1. Load the data.table (and the dtplyr and dplyr packages).
library(data.table)
library(dtplyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:data.table':
## 
##     between, first, last
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
  1. Load the met data from https://raw.githubusercontent.com/JSC370/JSC370-2024/main/data/met_all_2023.gz, and also the station data. For the latter, you can use the code we used during lecture to pre-process the stations data:
# Download the data
stations <- fread("ftp://ftp.ncdc.noaa.gov/pub/data/noaa/isd-history.csv")
stations[, USAF := as.integer(USAF)]
## Warning in eval(jsub, SDenv, parent.frame()): NAs introduced by coercion
# Dealing with NAs and 999999
stations[, USAF   := fifelse(USAF == 999999, NA_integer_, USAF)]
stations[, CTRY   := fifelse(CTRY == "", NA_character_, CTRY)]
stations[, STATE  := fifelse(STATE == "", NA_character_, STATE)]

# Selecting the three relevant columns, and keeping unique records
stations <- unique(stations[, list(USAF, CTRY, STATE)])

# Dropping NAs
stations <- stations[!is.na(USAF)]

# Removing duplicates
stations[, n := 1:.N, by = .(USAF)]
stations <- stations[n == 1,][, n := NULL]

# Read in the met data
download.file(
  "https://raw.githubusercontent.com/JSC370/JSC370-2025/main/data/met/met_all.gz",
  destfile = "met_all.gz",
  method   = "curl",
  timeout  = 60
  )

met <- data.table::fread("met_all.gz")
  1. Merge the data as we did during the lecture. Use the merge() code and you can also try the tidy way with left_join()
met <- merge(x = met, y = stations, by.x = "USAFID", by.y = "USAF", all.x = TRUE, all.y = TRUE)
#met <- left_join(x = met, y = stations, by = c("USFAID" = "USAF"))

Question 1: Representative station for the US

Across all weather stations, what stations have the median values of temperature, wind speed, and atmospheric pressure? Using the quantile() function, identify these three stations. Do they coincide?

# find median values
medians <- met[, .(
  temp_50 = quantile(temp, probs = .5, na.rm = TRUE),
  wind.sp_50 = quantile(wind.sp, probs = .5, na.rm = TRUE),
  atm.press_50 = quantile(atm.press, probs = .5, na.rm = TRUE)
)]

medians
##    temp_50 wind.sp_50 atm.press_50
##      <num>      <num>        <num>
## 1:    23.5        2.1       1014.1
station_med <- met[, .(
  temp = quantile(temp, probs = .5, na.rm = TRUE),
  wind.sp = quantile(wind.sp, probs = .5, na.rm = TRUE),
  atm.press = quantile(atm.press, probs = .5, na.rm = TRUE)
), by = .(USAFID, STATE)]
station_med[, temp_dist := abs(temp-medians$temp_50)]
median_temp_station <- station_med[temp_dist == 0]
median_temp_station
##    USAFID  STATE  temp wind.sp atm.press temp_dist
##     <int> <char> <num>   <num>     <num>     <num>
## 1: 720501     VA  23.5     1.5        NA         0
## 2: 722031     AL  23.5     0.0        NA         0
## 3: 722148     NC  23.5     0.0        NA         0
## 4: 723055     NC  23.5     0.0        NA         0
## 5: 723067     NC  23.5     1.5        NA         0
## 6: 723177     NC  23.5     0.0        NA         0
## 7: 725564     NE  23.5     2.6        NA         0
station_med[, wind.sp_dist := abs(wind.sp-medians$wind.sp_50)]
median_wind.sp_station <- station_med[wind.sp_dist == 0]
median_wind.sp_station
##      USAFID  STATE  temp wind.sp atm.press temp_dist wind.sp_dist
##       <int> <char> <num>   <num>     <num>     <num>        <num>
##   1: 720110     TX  31.0     2.1        NA       7.5            0
##   2: 720258     MN  17.0     2.1        NA       6.5            0
##   3: 720266     IN  21.0     2.1        NA       2.5            0
##   4: 720272     WA  18.0     2.1        NA       5.5            0
##   5: 720273     TX  28.6     2.1        NA       5.1            0
##  ---                                                             
## 339: 726583     MN  21.0     2.1        NA       2.5            0
## 340: 726589     MN  20.0     2.1        NA       3.5            0
## 341: 726603     MN  20.7     2.1        NA       2.8            0
## 342: 726626     WI  16.6     2.1        NA       6.9            0
## 343: 726813     ID  22.8     2.1   1011.75       0.7            0
station_med[, atm.press_dist := abs(atm.press-medians$atm.press_50)]
median_atm.press_station <- station_med[atm.press_dist == 0]
median_atm.press_station
##     USAFID  STATE  temp wind.sp atm.press temp_dist wind.sp_dist atm.press_dist
##      <int> <char> <num>   <num>     <num>     <num>        <num>          <num>
##  1: 722420     TX  30.0     4.6    1014.1       6.5          2.5              0
##  2: 723830     CA  23.3     5.1    1014.1       0.2          3.0              0
##  3: 724885     NV  24.7     2.6    1014.1       1.2          0.5              0
##  4: 724940     CA  18.9     5.1    1014.1       4.6          3.0              0
##  5: 725376     MI  22.8     3.1    1014.1       0.7          1.0              0
##  6: 725975     OR  16.1     2.1    1014.1       7.4          0.0              0
##  7: 726183     ME  18.9     0.0    1014.1       4.6          2.1              0
##  8: 726375     MI  21.1     3.1    1014.1       2.4          1.0              0
##  9: 726579     MN  20.0     3.1    1014.1       3.5          1.0              0
## 10: 726584     MN  20.0     3.1    1014.1       3.5          1.0              0
## 11: 726590     SD  20.0     3.1    1014.1       3.5          1.0              0

Knit the document, commit your changes, and save it on GitHub. Don’t forget to add README.md to the tree, the first time you render it.

Question 2: Representative station per state

Just like the previous question, you are asked to identify what is the most representative, the median, station per state. This time, instead of looking at one variable at a time, look at the euclidean distance. If multiple stations show in the median, select the one located at the lowest latitude.

station_med[, temp_50 := quantile(temp, probs = .5, na.rm = TRUE), by = STATE]
station_med[, wind.sp_50 := quantile(wind.sp, probs = .5, na.rm = TRUE), by = STATE]

station_med[, eudist := sqrt(
  (temp - temp_50)^2 + (wind.sp - wind.sp_50)^2
)]

id_station <- station_med[, .SD[which.min(eudist)], by = STATE]

id_station <- merge(
  x = id_station, y = stations,
  by.x = "USAFID", by.y = "USAF",
  all.x = TRUE, all.y = FALSE
)

Knit the doc and save it on GitHub.

Question 3: In the middle?

For each state, identify what is the station that is closest to the mid-point of the state. Combining these with the stations you identified in the previous question, use leaflet() to visualize all ~100 points in the same figure, applying different colors for those identified in this question.

# Calculate the midpoint (mean latitude and longitude) per state
mid_point <- met[, .(
  lat_50 = quantile(lat, probs = .5, na.rm = TRUE),
  lon_50 = quantile(lon, probs = .5, na.rm = TRUE)
), by = STATE]

mid <- merge(x = met, y = mid_point, by = "STATE")
# Compute Euclidean distance to the midpoint
mid[, mid_eudist := sqrt (
  (lon - lon_50)^2 + (lat - lat_50)^2
)]

# Select the closest station to the midpoint
mid_station <- mid[, .SD[which.min(mid_eudist)], by = STATE]

id_station <- merge(id_station, met[, .(USAFID, lat, lon)], by = "USAFID", all.x = TRUE)

# Load leaflet package
library(leaflet)

leaflet() %>%
  addProviderTiles('CartoDB.Positron') %>%
  addCircles(
    data = mid_station,
    lat = ~lat, lng = ~lon, popup = ~paste("Midpoint Station:", STATE),
    opacity = 1, fillOpacity = 1, radius = 400, color = "blue"
  ) %>%
  addCircles(
    data = id_station,
    lat = ~lat, lng = ~lon, popup = ~paste("Representative Station:", STATE.x),
    opacity = 1, fillOpacity = 1, radius = 400, color = "magenta"
  )

Knit the doc and save it on GitHub.

Question 4: Means of means

Using the quantile() function, generate a summary table that shows the number of states included, average temperature, wind-speed, and atmospheric pressure by the variable “average temperature level,” which you’ll need to create.

Start by computing the states’ average temperature. Use that measurement to classify them according to the following criteria:

  • low: temp < 20
  • Mid: temp >= 20 and temp < 25
  • High: temp >= 25
met[, elev_cat := fifelse(elev < 90, "low-elev", "high-elev")]

Once you are done with that, you can compute the following:

  • Number of entries (records),
  • Number of NA entries,
  • Number of stations,
  • Number of states included, and
  • Mean temperature, wind-speed, and atmospheric pressure.

All by the levels described before.

library(tidyr)
summary_table <- met |>
  group_by(STATE, elev_cat) |>
  summarize (temp_mean = mean (temp, na.rm=T) ) |>
  pivot_wider(names_from = elev_cat, values_from = temp_mean)
## `summarise()` has grouped output by 'STATE'. You can override using the
## `.groups` argument.
#summary_table <- summary_table |>
#  mutate(avg_temp_level = case_when(
#    temp_mean < 20 ~ "low",
#    temp_mean >= 20 & temp_mean < 25 ~ "mid",
#    temp_mean >= 25 ~ "high"
#  ))
# Install and load kableExtra
library(kableExtra)
## 
## Attaching package: 'kableExtra'
## The following object is masked from 'package:dplyr':
## 
##     group_rows
# Generate the table
kable(summary_table, booktabs = TRUE) %>%
  kable_styling(font_size = 10) %>%
  kable_paper("hover", full_width = FALSE)
STATE NA high-elev low-elev
AK NaN NA NA
AL NaN 25.92562 26.90432
AR NaN 25.71858 26.87350
AS NaN NA NA
AZ NaN 28.80596 NA
BC NaN NA NA
CA NaN 23.72283 21.13167
CO NaN 19.54725 NA
CR NaN NA NA
CT NaN 21.81456 22.50812
DE NaN NA 24.58116
FL NaN NaN 27.53747
FM NaN NA NA
GA NaN 26.35009 26.81120
GU NaN NA NA
HI NaN NA NA
IA NaN 21.27773 NA
ID NaN 20.69554 NA
IL NaN 22.41005 NA
IN NaN 21.76562 NA
KS NaN 24.25538 NA
KY NaN 23.87157 NA
LA NaN 27.97857 27.97381
MA NaN 20.37799 21.74306
MB NaN NA NA
MD NaN 23.47545 25.19608
ME NaN 18.32004 19.26441
MH NaN NA NA
MI NaN 20.19981 NA
MN NaN 19.31893 20.91976
MO NaN 23.87039 NA
MP NaN NA NA
MS NaN 26.04596 26.83332
MT NaN 18.16680 NA
NB NaN NA NA
NC NaN 23.51121 25.41548
ND NaN 18.37173 NA
NE NaN 22.10408 NA
NH NaN 17.98781 20.88998
NJ NaN 21.59745 23.43003
NM NaN 24.47771 NA
NS NaN NA NA
NT NaN NA NA
NU NaN NA NA
NV NaN 26.04296 NA
NY NaN 19.31104 21.98619
OH NaN 21.83450 NA
OK NaN 27.40891 NA
ON NaN NA NA
OR NaN 19.10970 17.16329
PA NaN 21.46292 25.00705
PC NaN NA NA
PR NaN NA NA
PW NaN NA NA
QC NaN NA NA
RI NaN 21.02958 22.70043
SC NaN 25.25343 26.32267
SD NaN 20.03650 NA
TN NaN 24.74959 27.53806
TX NaN 29.52913 29.80697
UM NaN NA NA
UT NaN 25.82056 NA
VA NaN 22.94130 24.96217
VI NaN NA NA
VT NaN 18.34464 21.10825
WA NaN 19.35326 18.98941
WI NaN 18.57907 NA
WV NaN 21.74214 NA
WY NaN 18.60170 NA
YK NaN NA NA
YT NaN NA NA
NA NaN NA NA

Knit the document, commit your changes, and push them to GitHub.

Question 5: Advanced Regression

Let’s practice running regression models with smooth functions on X. We need the mgcv package and gam() function to do this.

  • using your data with the median values per station, examine the association between median temperature (y) and median wind speed (x). Create a scatterplot of the two variables using ggplot2. Add both a linear regression line and a smooth line.

  • fit both a linear model and a spline model (use gam() with a cubic regression spline on wind speed). Summarize and plot the results from the models and interpret which model is the best fit and why.

station_med_lt <- lazy_dt(station_med)
station_med_lt <- station_med_lt |>
  filter(between(atm.press, 1000, 1020)) |>
  collect()

library(ggplot2)
ggplot (station_med_lt , aes (x = atm.press, y=temp)) + 
  geom_point() +
  geom_smooth(method="lm", col="cyan") +
  geom_smooth(method="gam", col="blue")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ s(x, bs = "cs")'

# linear with temp being y, atm press being x
lm_mod <- lm (temp~atm.press, data=station_med_lt)

summary (lm_mod)
## 
## Call:
## lm(formula = temp ~ atm.press, data = station_med_lt)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -14.928  -2.390   0.044   2.525   8.323 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1041.37307   65.49707   15.90   <2e-16 ***
## atm.press     -1.00374    0.06459  -15.54   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.33 on 898 degrees of freedom
## Multiple R-squared:  0.212,  Adjusted R-squared:  0.2111 
## F-statistic: 241.5 on 1 and 898 DF,  p-value: < 2.2e-16
library(mgcv)
## Loading required package: nlme
## 
## Attaching package: 'nlme'
## The following object is masked from 'package:dplyr':
## 
##     collapse
## This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
# bs-"cr" cubic regrssion line, 20 degree of freedom,
gam_mod <- gam(temp~s(atm.press, bs="cr", k=20), data=station_med_lt)
summary(gam_mod)
## 
## Family: gaussian 
## Link function: identity 
## 
## Formula:
## temp ~ s(atm.press, bs = "cr", k = 20)
## 
## Parametric coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  23.4852     0.1051   223.4   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Approximate significance of smooth terms:
##                edf Ref.df     F p-value    
## s(atm.press) 9.861  11.83 31.76  <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## R-sq.(adj) =  0.293   Deviance explained = 30.1%
## GCV = 10.064  Scale est. = 9.9425    n = 900
plot(gam_mod)